Job Management Requirements for NAS Parallel Systems and Clusters
نویسندگان
چکیده
A job management system is a critical component of a production supercomputing environment, permitting oversubscribed resources to be shared fairly and efficiently. Job management systems that were originally designed for traditional vector supercomputers are not appropriate for the distributed-memory parallel supercomputers that are becoming increasingly important in the high performance computing industry. Newer job management systems offer new functionality but do not solve fundamental problems. We address some of the main issues in resource allocation and job scheduling we have encountered on two parallel computers — a 160-node IBM SP2 and a cluster of 20 high performance workstations located at the Numerical Aerodynamic Simulation facility. We describe the requirements for resource allocation and job management that are necessary to provide a production supercomputing environment on these machines, prioritizing according to difficulty and importance, and advocating a return to fundamental issues.
منابع مشابه
Novel HPC Technologies for Scalable CAE: The Case for Parallel I/O and File Systems
As HPC continues its aggressive platform migration from proprietary supercomputers and Unix servers to HPC clusters, expectations grow for clusters to meet the I/O demands of increasing fidelity in CAE modeling and data management in the CAE workflow. Cluster deployments have increased as organizations seek ways to costeffectively grow compute resources for CAE applications, and during this mig...
متن کاملACL 2 for Parallel Systems Software : A Progress Report
A significant development in high-performance computing has occurred in recent years with the proliferation of “Beowulf” clusters [6]. Beowulf clusters are parallel computers assembled from commodity-priced personal computers and networks. The explosive growth of the personal computer marketplace, together with rapid technological advances in the hardware sold there, has driven the price/perfor...
متن کاملA Comparison of Workload Traces from Two Production Parallel Machines
The analysis of workload traces from real production parallel machines can aid a wide variety of parallel processing research, providing a realistic basis for experimentation in the management of resources over an entire workload. We analyze a ve-month workload trace of an Intel Paragon machine supporting a production parallel workload at the San Diego Supercomputer Center (SDSC), comparing and...
متن کاملObject Storage: Scalable Bandwidth for HPC Clusters
This paper describes the Object Storage Architecture solution for cost-effective, high bandwidth storage in High Performance Computing (HPC) environments. An HPC environment requires a storage system to scale to very large sizes and performance without sacrificing cost-effectiveness nor ease of sharing and managing data. Traditional storage solutions, including disk-per-node, Storage-Area Netwo...
متن کاملJAMILA: A Usable Batch Job Management System to Coordinate Heterogeneous Clusters and Diverse Applications over Grid or Cloud Infrastructure
Usability is an important feature of Grids or Clouds to end users, who may not be computer professionals but need to use massive machines to compute their jobs. For meeting various computing or management requirements, heterogeneous clusters with diverse Distributed Resource Management Systems (D-RMS) and applications are needed to supply computing services in Grids or Clouds. The heterogeneity...
متن کامل